logo

Motivation

Sports analytics: collection of relevant, historical, statistics that can provide a competitive advantage to a team or individual.

In this case study, we will analyze data from NBA and try to answer these questions

# delete everything
rm(list=ls()) 

library(tidyverse)
library(GGally) # ggplot2-based visualization of correlations
library(factoextra) # ggplot2-based visualization of pca

Load and prepare data

Currently, the NBA is composed of 30 teams (29 in the United States and 1 in Canada) with more than 500 players

Data from last season: https://www.basketball-reference.com/leagues/NBA_2022_per_game.html

The dataset contains skill performance of all the players (605) during the last season 2020-21

The glossary is in the web page. Performance is per game

players.df = read.csv(file = "NBA_2022.csv")

glimpse(players.df)
## Rows: 812
## Columns: 30
## $ Rk     <int> 1, 2, 3, 4, 5, 6, 6, 6, 7, 8, 9, 10, 10, 10, 11, 12, 13, 14, 15…
## $ Player <chr> "Precious Achiuwa\\achiupr01", "Steven Adams\\adamsst01", "Bam …
## $ Pos    <chr> "C", "C", "C", "PF", "C", "SG", "SG", "SG", "SG", "C", "PG", "S…
## $ Age    <int> 22, 28, 24, 21, 36, 23, 23, 23, 26, 23, 23, 28, 28, 28, 28, 27,…
## $ Tm     <chr> "TOR", "MEM", "MIA", "MEM", "BRK", "TOT", "NOP", "UTA", "MIL", …
## $ G      <int> 73, 76, 56, 32, 47, 65, 50, 15, 66, 56, 54, 16, 3, 13, 69, 67, …
## $ GS     <int> 28, 75, 56, 0, 12, 21, 19, 2, 61, 56, 1, 6, 0, 6, 11, 67, 6, 3,…
## $ MP     <dbl> 23.6, 26.3, 32.6, 11.3, 22.3, 22.6, 26.3, 9.9, 27.3, 32.3, 15.4…
## $ FG     <dbl> 3.6, 2.8, 7.3, 1.7, 5.4, 3.9, 4.7, 1.1, 3.9, 6.6, 2.4, 2.3, 1.3…
## $ FGA    <dbl> 8.3, 5.1, 13.0, 4.1, 9.7, 10.5, 12.6, 3.2, 8.6, 9.7, 5.4, 5.9, …
## $ FG.    <dbl> 0.439, 0.547, 0.557, 0.402, 0.550, 0.372, 0.375, 0.333, 0.448, …
## $ X3P    <dbl> 0.8, 0.0, 0.0, 0.2, 0.3, 1.6, 1.9, 0.7, 2.4, 0.0, 0.6, 0.9, 0.7…
## $ X3PA   <dbl> 2.1, 0.0, 0.1, 1.5, 1.0, 5.2, 6.1, 2.2, 5.9, 0.2, 2.0, 3.7, 2.0…
## $ X3P.   <dbl> 0.359, 0.000, 0.000, 0.125, 0.304, 0.311, 0.311, 0.303, 0.409, …
## $ X2P    <dbl> 2.9, 2.8, 7.3, 1.5, 5.1, 2.3, 2.8, 0.4, 1.5, 6.6, 1.8, 1.3, 0.7…
## $ X2PA   <dbl> 6.1, 5.0, 12.9, 2.6, 8.8, 5.3, 6.5, 1.0, 2.7, 9.6, 3.4, 2.3, 0.…
## $ X2P.   <dbl> 0.468, 0.548, 0.562, 0.560, 0.578, 0.433, 0.434, 0.400, 0.533, …
## $ eFG.   <dbl> 0.486, 0.547, 0.557, 0.424, 0.566, 0.449, 0.450, 0.438, 0.588, …
## $ FT     <dbl> 1.1, 1.4, 4.6, 0.6, 1.9, 1.2, 1.4, 0.7, 1.0, 2.9, 0.7, 0.9, 1.0…
## $ FTA    <dbl> 1.8, 2.6, 6.1, 1.0, 2.2, 1.7, 1.9, 0.8, 1.1, 4.2, 1.0, 1.2, 1.3…
## $ FT.    <dbl> 0.595, 0.543, 0.753, 0.625, 0.873, 0.743, 0.722, 0.917, 0.865, …
## $ ORB    <dbl> 2.0, 4.6, 2.4, 1.0, 1.6, 0.6, 0.7, 0.1, 0.5, 3.4, 0.5, 0.3, 0.0…
## $ DRB    <dbl> 4.5, 5.4, 7.6, 1.7, 3.9, 2.3, 2.6, 1.5, 2.9, 7.3, 1.4, 2.6, 2.0…
## $ TRB    <dbl> 6.5, 10.0, 10.1, 2.7, 5.5, 2.9, 3.3, 1.5, 3.4, 10.8, 1.9, 2.9, …
## $ AST    <dbl> 1.1, 3.4, 3.4, 0.7, 0.9, 2.4, 2.8, 1.1, 1.5, 1.6, 2.8, 2.1, 2.0…
## $ STL    <dbl> 0.5, 0.9, 1.4, 0.2, 0.3, 0.7, 0.8, 0.3, 0.7, 0.8, 1.3, 0.5, 0.3…
## $ BLK    <dbl> 0.6, 0.8, 0.8, 0.3, 1.0, 0.4, 0.4, 0.3, 0.3, 1.3, 0.1, 0.4, 0.0…
## $ TOV    <dbl> 1.2, 1.5, 2.6, 0.5, 0.9, 1.4, 1.7, 0.5, 0.7, 1.7, 0.7, 0.5, 0.7…
## $ PF     <dbl> 2.1, 2.0, 3.1, 1.1, 1.7, 1.6, 1.8, 1.0, 1.5, 1.7, 1.4, 1.4, 1.3…
## $ PTS    <dbl> 9.1, 6.9, 19.1, 4.1, 12.9, 10.6, 12.8, 3.5, 11.1, 16.1, 6.1, 6.…

Missing values

hist(rowMeans(is.na(players.df)))

barplot(colMeans(is.na(players.df)), las=2)

Conclusions? What can we do?

Some feature engineering

We are going to skip the following variables:

  • Rk: Rank

  • Player: Player’s name

  • Pos: Position

  • Age: Player’s age on February 1 of the season

  • Tm: Team

  • G and GS: Games played and started (they depend on the coach, not the player’s skills)

  • MP: Minutes played per game (they depend on the coach, not the player’s skills)

nba <- players.df[, 9:ncol(players.df)]

names = players.df[, 2]
names = names %>% str_replace("\\\\.*$", "")  
pos = players.df[,3]
team = players.df[,5]
gp = players.df[,6]
min = players.df[,8]
points = players.df[,30]

dim(nba)
## [1] 812  22
summary(nba)
##        FG             FGA              FG.              X3P        
##  Min.   : 0.00   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 1.20   1st Qu.: 3.000   1st Qu.:0.3850   1st Qu.:0.2000  
##  Median : 2.40   Median : 5.150   Median :0.4410   Median :0.7000  
##  Mean   : 2.87   Mean   : 6.387   Mean   :0.4343   Mean   :0.8713  
##  3rd Qu.: 3.90   3rd Qu.: 8.725   3rd Qu.:0.5000   3rd Qu.:1.4000  
##  Max.   :11.40   Max.   :21.800   Max.   :1.0000   Max.   :4.5000  
##                                   NA's   :15                       
##       X3PA             X3P.             X2P           X2PA       
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0   Min.   : 0.000  
##  1st Qu.: 0.800   1st Qu.:0.2587   1st Qu.:0.7   1st Qu.: 1.400  
##  Median : 2.050   Median :0.3310   Median :1.5   Median : 3.000  
##  Mean   : 2.561   Mean   :0.3034   Mean   :2.0   Mean   : 3.829  
##  3rd Qu.: 3.900   3rd Qu.:0.3762   3rd Qu.:2.8   3rd Qu.: 5.100  
##  Max.   :11.700   Max.   :1.0000   Max.   :9.5   Max.   :18.300  
##                   NA's   :72                                     
##       X2P.             eFG.              FT             FTA        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:0.4507   1st Qu.:0.4640   1st Qu.:0.400   1st Qu.: 0.500  
##  Median :0.5160   Median :0.5170   Median :0.900   Median : 1.200  
##  Mean   :0.5055   Mean   :0.4975   Mean   :1.204   Mean   : 1.575  
##  3rd Qu.:0.5793   3rd Qu.:0.5630   3rd Qu.:1.600   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :9.600   Max.   :11.800  
##  NA's   :28       NA's   :15                                       
##       FT.              ORB              DRB             TRB        
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:0.6735   1st Qu.:0.3000   1st Qu.: 1.30   1st Qu.: 1.700  
##  Median :0.7650   Median :0.6000   Median : 2.30   Median : 2.900  
##  Mean   :0.7476   Mean   :0.8129   Mean   : 2.52   Mean   : 3.332  
##  3rd Qu.:0.8460   3rd Qu.:1.1000   3rd Qu.: 3.40   3rd Qu.: 4.400  
##  Max.   :1.0000   Max.   :4.6000   Max.   :11.00   Max.   :14.700  
##  NA's   :97                                                        
##       AST              STL              BLK              TOV        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.500   1st Qu.:0.3000   1st Qu.:0.1000   1st Qu.:0.4000  
##  Median : 1.200   Median :0.5000   Median :0.3000   Median :0.8000  
##  Mean   : 1.808   Mean   :0.5828   Mean   :0.3536   Mean   :0.9787  
##  3rd Qu.: 2.400   3rd Qu.:0.9000   3rd Qu.:0.5000   3rd Qu.:1.3000  
##  Max.   :10.800   Max.   :2.5000   Max.   :2.8000   Max.   :4.8000  
##                                                                     
##        PF             PTS        
##  Min.   :0.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.: 3.300  
##  Median :1.600   Median : 6.300  
##  Mean   :1.565   Mean   : 7.812  
##  3rd Qu.:2.200   3rd Qu.:10.600  
##  Max.   :5.000   Max.   :30.600  
## 

We remove some variable as they are redundant:

  • PTS, because PTS = FT + 2 * X2P + 3 * X3P
  • TRB, because TRB = ORB + DRB
  • FG and FGA, because FG = X2P + X3P and FGA = X2PA + X3PA
  • eFG., because eFG. = (FG + 0.5 * X3P) / FGA = (X2P + 1.5 * X3P) / (X2PA + X3PA)
  • X2P. and X3P., because X2P. = X2P / X2PA and X3P. = X3P / X3PA

The decision about the final variables to consider (input) is key and will affect the output.

We are going to consider the following variables to measure skill performance:

  • X2P: 2-Point Field Goals Per Game
  • X2PA: 2-Point Field Goal Attempts Per Game
  • X3P: 3-Point Field Goals Per Game
  • X3PA: 3-Point Field Goal Attempts Per Game
  • FT: Free Throws Per Game
  • FTA: Free Throw Attempts Per Game
  • ORB: Offensive Rebounds Per Game
  • DRB: Defensive Rebounds Per Game
  • AST: Assists Per Game
  • STL: Steals Per Game
  • BLK: Blocks Per Game
  • TOV: Turnovers Per Game
  • PF: Personal Fouls Per Game
nba = nba[, c("X2P", "X2PA", "X3P", "X3PA", "FT", "FTA", "ORB", "DRB", "AST", "STL", "BLK", "TOV", "PF")]
dim(nba)
## [1] 812  13

Descriptive Analysis

Our input has dimension \(p = 13\), that implies \(2^p\) different relations between the variables.

Dimension 1: univariate analysis for all 13 variables

# Enter your code here to plot the boxplots and scaled boxplots

Dimension 2: bivariate analysis (scatter plots), in total 105

# Enter your code here to display a correlation graph

Conclusions?

Dimension 3: proportional to \(13^3\) relations, but no way to obtain information… This is why we need an analytical tool to reduce the dimension

Finally, note there are variables highly correlated, especially the most related ones (like ftm and fta)

PCA

From dimension 13 to dimension 2

pca = prcomp(nba, scale=T)
# pca = princomp(nba, cor=T) # the same, but using SVD instead of eigen decomposition 
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.7676 1.4620 0.96913 0.81706 0.62185 0.57505 0.53443
## Proportion of Variance 0.5892 0.1644 0.07225 0.05135 0.02975 0.02544 0.02197
## Cumulative Proportion  0.5892 0.7536 0.82586 0.87722 0.90696 0.93240 0.95437
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.50433 0.40255 0.34939 0.16989 0.12057 0.10645
## Proportion of Variance 0.01957 0.01246 0.00939 0.00222 0.00112 0.00087
## Cumulative Proportion  0.97394 0.98640 0.99579 0.99801 0.99913 1.00000

Insights?

This is the same, but using mathematical format; here, eigenvalues denote variances and eigenvectors denote loadings:

R = cor(nba)   # correlation matrix
eigen(R)  
## eigen() decomposition
## $values
##  [1] 7.65947992 2.13754026 0.93921182 0.66758529 0.38669376 0.33068023
##  [7] 0.28561635 0.25434919 0.16204491 0.12206997 0.02886108 0.01453617
## [13] 0.01133104
## 
## $vectors
##             [,1]        [,2]       [,3]        [,4]        [,5]        [,6]
##  [1,] -0.3283502  0.13090889 -0.2005587  0.04829601  0.15155669 -0.11847426
##  [2,] -0.3338110  0.05726485 -0.2149608  0.01997509  0.15079201 -0.10792035
##  [3,] -0.2165818 -0.44988703  0.3270124  0.34979052  0.09813958 -0.11673509
##  [4,] -0.2220291 -0.45601676  0.3028672  0.31876463  0.09909183 -0.11644052
##  [5,] -0.3151506 -0.02220766 -0.3602631  0.24802791 -0.26081680  0.22320970
##  [6,] -0.3176003  0.03410196 -0.3538379  0.23496798 -0.23251370  0.20947568
##  [7,] -0.1706977  0.51802347  0.1647788  0.03243788  0.43806350 -0.25224166
##  [8,] -0.3064642  0.21531566  0.1210756  0.08991934  0.20035959 -0.20723908
##  [9,] -0.2806262 -0.23411852 -0.1791940 -0.44990852 -0.00494676 -0.17663130
## [10,] -0.2571328 -0.13595733  0.2875929 -0.59275159 -0.26273293 -0.05941315
## [11,] -0.1984953  0.38398402  0.3809544  0.15411136 -0.68250377 -0.21239041
## [12,] -0.3202672 -0.11113691 -0.1299566 -0.24102592  0.08538040 -0.06084444
## [13,] -0.2728810  0.14063314  0.3758526 -0.10873256  0.19635650  0.81444139
##              [,7]         [,8]        [,9]        [,10]         [,11]
##  [1,]  0.06126662  0.544538719 -0.07174945 -0.035707018 -0.1376590598
##  [2,]  0.04303944  0.543542648 -0.01279845 -0.022388792  0.1185337555
##  [3,]  0.05344343 -0.016030111  0.12455926 -0.084641284 -0.6786485977
##  [4,]  0.00579295  0.008854199  0.07311835  0.021738245  0.7053768001
##  [5,]  0.18793768 -0.247888195  0.05616742 -0.068538871  0.0239410677
##  [6,]  0.21219982 -0.245950849  0.07334150  0.006354774  0.0028134146
##  [7,]  0.23485823 -0.294879836  0.51184162 -0.105619837  0.0468640306
##  [8,] -0.06206350 -0.336728258 -0.79512629  0.040176377 -0.0155598338
##  [9,] -0.39716825 -0.192994670  0.08975304 -0.628037364  0.0199684449
## [10,]  0.62508721  0.016156205 -0.06170887  0.117227255 -0.0009188301
## [11,] -0.32100339  0.131473087  0.10809410 -0.030908786  0.0078434368
## [12,] -0.40975843 -0.152064699  0.21801742  0.743916295 -0.0734002718
## [13,] -0.18089462  0.086639395 -0.00244457 -0.102661750  0.0033532861
##               [,12]        [,13]
##  [1,]  0.6509532697 -0.214704267
##  [2,] -0.6706093068  0.206338548
##  [3,] -0.1008515625  0.040220826
##  [4,]  0.1487045165 -0.020845168
##  [5,] -0.2051007099 -0.666161151
##  [6,]  0.2175727792  0.679210925
##  [7,] -0.0156227616 -0.033075890
##  [8,] -0.0497566704  0.009299392
##  [9,]  0.0419091939  0.032480254
## [10,]  0.0027552474 -0.014898604
## [11,] -0.0168137152  0.006779637
## [12,]  0.0093413656 -0.040742026
## [13,]  0.0009904054  0.003574908

How many components?

screeplot(pca,main="Screeplot",col="blue",type="barplot",pch=19)

Nicer with factoextra package:

fviz_screeplot(pca, addlabels = TRUE)

Note with 2 components we explain 75% of variability

Interpretation of components

First component:

barplot(pca$rotation[,1], las=2, col="darkblue")

Any hint for the meaning of the 1st PC?

Note the sum of the squared loadings (eigenvectors) is equal to 1

sum(pca$rotation[,1]^2)
## [1] 1

That means squared loadings are easier to interpret than the loadings

I.e. they are like percentages (numbers between 0 and 1)

So let’s plot squared loadings instead

They are called contribution of variables to components

fviz_contrib(pca, choice = "var", axes = 1)

The red dashed line on the graph above indicates the expected average contribution

If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/15 = 6.6%

Now we can rank the players by their first PC scores: best historical players in terms of performance:

# The worst
names[order(pca$x[,1])][(length(names)-5):length(names)]
## [1] "Sam Dekker"      "Jaime Echenique" "Tyler Hall"      "DeJon Jarreau"  
## [5] "Juwan Morgan"    "Ade Murkey"
# The best
names[order(pca$x[,1])][1:10]
##  [1] "Joel Embiid"           "Giannis Antetokounmpo" "Nikola Jokić"         
##  [4] "Luka Dončić"           "James Harden"          "James Harden"         
##  [7] "LeBron James"          "Kevin Durant"          "James Harden"         
## [10] "Trae Young"

Nikola Jokić was the Most Valuable Player in the season: https://www.basketball-reference.com/leagues/NBA_2021_leaders.html

Second component:

barplot(pca$rotation[,2], las=2, col="darkblue")

Any insight about this component? Forward vs Guard?

Maybe we can get more insights by ranking the players using this component:

names[order(pca$x[,2])][1:6]
## [1] "Stephen Curry"    "Fred VanVleet"    "Trae Young"       "Damian Lillard"  
## [5] "Donovan Mitchell" "Darius Garland"
names[order(pca$x[,2])][(length(names)-5):length(names)]
## [1] "Anthony Davis"     "Clint Capela"      "Jakob Poeltl"     
## [4] "Mitchell Robinson" "Robert Williams"   "Rudy Gobert"

Contribution of variables to second component:

Take care because we loose the sign (to get contribution in percentage)

fviz_contrib(pca, choice = "var", axes = 2)

Once we have interpreted the meaning of the first two components, let’s see the contribution of each player to components

For the \(i\)-th player and first component, the contribution is: \(z_{1,i}^2 / \lambda_1 / n\), which is a number between 0 and 1

head(get_pca_ind(pca)$contrib[,1]) # this is in %, that is between 0 and 100
##          1          2          3          4          5          6 
## 0.02361101 0.09512845 0.69160618 0.04284392 0.03938898 0.01493083
head((pca$x[,1]^2)/(pca$sdev[1]^2))/dim(nba)[1] # this is between 0 and 1
## [1] 0.0002361101 0.0009512845 0.0069160618 0.0004284392 0.0003938898
## [6] 0.0001493083

Let’s visualize the top-100 players contributions to first component (global performance):

fviz_contrib(pca, choice = "ind", axes = 1, top=100)

Let’s see the first names:

names[order(get_pca_ind(pca)$contrib[,1],decreasing=T)][1:10]
##  [1] "Joel Embiid"           "Giannis Antetokounmpo" "Nikola Jokić"         
##  [4] "Luka Dončić"           "James Harden"          "James Harden"         
##  [7] "LeBron James"          "Kevin Durant"          "James Harden"         
## [10] "Trae Young"
# It is very similar to names[order(pca$x[,1])][1:10] but in percentage

Finally, let’s make a zoom to see the top-20 players in contributions:

names_z1 = names[order(get_pca_ind(pca)$contrib[,1],decreasing=T)]
fviz_contrib(pca, choice = "ind", axes = 1, top=20)+scale_x_discrete(labels=names_z1)

The Biplot

Biplot: observations and variables in same graph (using first 2 components)

biplot(pca)

Not informative in this case: too many players

Nicer and using contributions (instead of loadings), without players:

fviz_pca_var(pca, col.var = "contrib")

Nicer but again too much information:

fviz_pca_biplot(pca, repel = TRUE)

The Scores

Remember, for the \(j\)-th principal component: \(Z_j = X a_j\), \(a_j\) denotes the loadings, and \(Z_j\) denotes the scores

Let’s plot the first two scores, using colors for minutes played:

data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=names,color=min)) + geom_point(size=0) +
  labs(title="PCA", x="PC1", y="PC2") +
  theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

The first two PCs seem independent, but this is not always the same. What is true is that they are always uncorrelated.

The first component is highly correlated with minutes (decided by coaches)

The same, but using colors for games played:

data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label=names,color=gp)) + geom_point(size=0) +
  labs(title="PCA", x="PC1", y="PC2") +
  theme_bw() + scale_color_gradient(low="yellow", high="darkred")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

Insights?

Which are the teams with the best players?

data.frame(z1=-pca$x[,1],team=team) %>% group_by(team) %>% summarise(mean=mean(z1)) %>% arrange(desc(mean))
## # A tibble: 31 × 2
##    team   mean
##    <chr> <dbl>
##  1 POR   1.09 
##  2 OKC   0.735
##  3 BRK   0.728
##  4 HOU   0.610
##  5 IND   0.573
##  6 GSW   0.516
##  7 DET   0.356
##  8 NOP   0.330
##  9 ORL   0.323
## 10 LAL   0.321
## # … with 21 more rows

Portland Trail Blazers has the best overall players

Other view: are the better players playing more minutes in a game?

data.frame(z1=-pca$x[,1],z2=min) %>% 
  ggplot(aes(z1,z2,label=names,color=gp)) + geom_point(size=0) +
  labs(title="Performance", x="PC1", y="Minutes per game") +
  scale_color_gradient(low="lightblue", high="darkblue") +
  theme_bw() + theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

Yes, but non-linear… why?

Are the better players playing more games?

data.frame(z1=-pca$x[,1],z2=gp) %>% 
  ggplot(aes(z1,z2,label=names,color=min)) + geom_point(size=0) +
  labs(title="Performance", x="PC1", y="Games played") +
  scale_color_gradient(low="lightblue", high="darkblue") +
  theme_bw() + theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)

Somehow, but not too much… why?